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Summary. Protein sequence alignments generally 
are constructed with the aid of a "substitution ma- 
trix" that specifies a score for aligning each pair of 
amino acids. Assuming a simple random protein 
model, it can be shown that any such matrix, when 
used for evaluating variable-length local align- 
ments, is implicitly a "log-odds" matrix, with a spe- 
cific probability distribution for amino acid pairs to 
which it is uniquely tailored. Given a model of pro- 
tein evolution from which such distributions may be 
derived, a substitution matrix adapted to detecting 
relationships at any chosen evolutionary distance 
can be constructed. Because in a database search it 
generally is not known a priori what evolutionary 
distances will characterize the similarities found, it 
is necessary to employ an appropriate range of ma- 
trices in order not to overlook potential homologies. 
This paper formalizes this concept by defining a 
scoring system that is sensitive at all detectable ev- 
olutionary distances. The statistical behavior of this 
scoring system is analyzed, and it is shown that for 
a typical protein database search, estimating the 
originally unknown evolutionary distance appropri- 
ate to each alignment costs slightly over two bits of 
information, or somewhat less than a factor of five 
in statistical significance. A much greater cost may 
be incurred, however, if only a single substitution 
matrix, corresponding to the wrong evolutionary 
distance, is employed. 
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Sequence alignment has become a widely used tool 
for the discovery and understanding of functional 
and evolutionary relationships among proteins. The 
earliest work in this area concentrated on "global" 
comparisons, aligning, for example, complete 
globin chains (Needleman and Wunsch 1970; Sell- 
ers 1974; Sankoff and Kruskal 1983). More recently 
the focus has shifted to "local" comparisons in 
which only small regions of the sequences com- 
pared need participate (Smith and Waterman 1981; 
Goad and Kanehisa 1982; Sellers 1984). Because 
proteins of related function often share only such 
isolated regions of similarity, e.g., near an active 
site, the most popular database search programs 
have all relied upon measures of local similarity 
(Lipman and Pearson 1985; Pearson and Lipman 
1988; Altschul et al. 1990). 

The local similarity of two sequences may be 
evaluated in a variety of manners. The simplest of 
these is the length of the longest matching segment, 
perhaps allowing for a specified number or propor- 
tion of mismatches (Arratia et al. 1986; Karlin and 
Ost 1988; Arratia and Waterman 1989). More sen- 
sitive measures take account of relationships among 
the amino acids, typically by assigning a score to 
every aligned pair of residues. Much thought has 
been given to choosing such a "substitution ma- 
trix' ' best able to distinguish biologically meaning- 
ful from chance similarities (McLachlan 1971; Day- 
hoff et al. 1978; Schwartz and Dayhoff 1978; Feng 
et al. 1985; Rao 1987; Risler et al. 1988; Altschul 
1991; States et al. 1991; Gonnet et al. 1992; Heni- 
koff and Henikoff 1992; Jones et al. 1992). Among 
such scoring systems, the most widely used and 
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most closely studied have been the log-odds matri- 
ces based upon the "PAM" model of protein evo- 
lution (Dayhoff et al. 1978; Schwartz and Dayhoff 
1978). Recently it has been argued that any substi- 
tution matrix used for assessing local alignments is 
imphcitly a log-odds matrix, best adapted for dis- 
tinguishing local alignments characterized by spe- 
cific frequencies for aligned amino acid pairs (Kar- 
lin and Altschul 1990; Altschul 1991). Accepting for 
the sake of analysis the PAM evolutionary model, it 
remains the case that when seeking protein similar- 
ities, the degree to which two segments have di- 
verged will not be known a priori. Therefore, unless 
PAM matrices suitable for a range of evolutionary 
distances are used, one may easily miss biologically 
meaningful alignments that could have been distin- 
guished from random noise (Altschul 1991). 

While the importance of using scores tailored for 
different evolutionary distances has been recog- 
nized frequently (Schwartz and Dayhoff 1978; Col- 
lins et al. 1988; Altschul 1991; States et al. 1991; 
Gonnet 1993), the statistical issues presented by 
such multiple tests have not been well studied. The 
basic problem can be understood by imagining us- 
ing a large number of different PAM matrices to 
compare the same two sequences. If only a single 
such matrix were used, the statistical significance of 
the highest-scoring segment pair may be easily 
bounded (Karlin and Altschul 1990). But selecting 
as well the PAM matrix that optimizes the align- 
ment score introduces a new degree of freedom. An 
alignment that would have been statistically signif- 
icant had only a single score matrix been used may 
no longer be surprising. 

In this paper we propose a formal definition for 
an "AU-PAM" scoring system that makes minimal 
a priori assumptions concerning the evolutionary 
distance between the sequences in the alignments 
sought. We study the distribution of optimal values 
of this scoring system when applied to random se- 
quences so that the significance of high-scoring 
alignments can be properly assessed. Finally, we 
describe how the BLAST algorithms (Altschul et al. 
1990) can be modified to accommodate an approx- 
imation of AU-PAM scores, and provide examples 
where such scores are useful for distinguishing bi- 
ologically meaningful similarities. 

Maximal Segment Pairs and Their 
Statistical Significance 

The measure of local sequence similarity we study 
is based upon a substitution matrix which assigns a 
score to every pair of amino acids. Given two pro- 
tein sequences, we define their maximal segment 
pair (MSP) as those two equal-length segments 
which when aligned have maximal aggregate score. 



These two segments may be of any length that max- 
imizes their score. The score of the maximal seg- 
ment pair we take as our measure of local similarity. 

In order to assess how great a score can be ex- 
pected to occur by chance, we need a model of 
chance. We will employ a simple model in which 
the residues of a protein are chosen independently, 
with amino acid / occurring with probability p,. A 
sequence fitting this model will be called a random 
protein sequence. 

It has been argued that, given this random model, 
any substitution matrix useful for defining MSP 
scores is implicitly a log-odds matrix (Karlin and 
Altschul 1990; Karlin et al. 1990; Akschul 1991). In 
other words, if amino acid / aligned with amino acid 
j has score Sij, then there are a specific set of target 
frequencies qy for which the scores can be written 



log 



PiPj 



(1) 



These scores are the ones best suited for distin- 
guishing local alignments in which amino acids / and 
j are aligned with frequency qy (Karlin and Altschul 
1990; Karlin et al. 1990; Altschul 1991; Dembo and 
Karlin 1991). 

A set of scores is changed in no essential way if 
all are multiplied by a positive constant: clearly the 
optimal alignment will not be changed. Such multi- 
plication affects only the base of the logarithm in 
equation (1), not the target frequencies. In the lan- 
guage of information theory, scores that are ex- 
pressed as natural logarithms are said to have the 
units oi nats (Hamming 1986). Because it is easier 
to do mental calculations using powers of 2, it is 
frequently more convenient to use logarithms to the 
base 2, and scores based on such logarithms are 
said to be expressed in bits. The difference between 
these units, like that between feet and meters, is 
simply a constant factor: 1 bit « 0.693 nat. Because 
the formulation of the mathematical concepts in this 
paper is simpler when natural logarithms are em- 
ployed, it will always be assumed below, except 
when otherwise stated, that substitution scores are 
normalized (multiphed by a suitable constant) so as 
to be expressed in nats. Specifying a set of target 
frequencies then completely specifies a set of 
scores. How these frequencies can be chosen will 
be discussed in the next section. 

Given normalized scores, the statistical theory 
for high scoring alignments takes on a particularly 
simple form. When two random sequences are com- 
pared, the number of distinct, or locally optimal 
(Sellers 1984), segment pairs expected to have score 
at least x is well approximated by the formula 
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where A^ is a calculable constant dependent on the 
scoring system, and N is the product of the se- 
quences' lengths (Karlin and Altschul 1990; Karlin 
et al. 1990, Dembo and Karlin 1991). More pre- 
cisely, the MSP scores take on an extreme value 
distribution (Gumbel 1958) whose cumulative distri- 
bution function is given by the formula 



exp[-e 



-\{.x — m) 



(3) 



where the scaling factor X. is 1 , and the character- 
istic value u is given by the formula 



u = \nKN 



The PAM Model of Protein Evolution 



(4) 



Log-odds scores for protein sequence comparison 
were first proposed by Dayhoff et al. (1978), though 
in the context of global sequence alignment. In or- 
der to calculate an appropriate set of target frequen- 
cies, they constructed from a study of homologous 
protein families the so-called PAM (for point- 
accepted mutation) model of protein evolution. The 
model presents protein evolution as a stochastic 
process in which at each time a given amino acid 
residue has certain fixed probabilities of mutating 
into any of the other residues. One PAM is defined 
as the amount of evolutionary change required for 
an expected 1% of all residues (suitably weighted by 
their relative frequencies) to mutate. The manner in 
which the particular numbers in the PAM model 
were derived has been criticized (Wilbur 1985), and 
a great deal of additional data has accumulated 
since the model was proposed. Therefore it is cer- 
tainly possible that better numbers than those de- 
rived by Dayhoff et al. (1978) can be found. This 
paper discusses a general manner in which any 
PAM-like model may be generalized, and for this 
purpose the values used by Dayhoff et al. (1978) will 
suffice. In fact, these numbers have proved very 
successful over the years (Feng et al. 1985). 

Given the PAM model, it is easy to calculate the 
frequency qy with which any pair of amino acids are 
expected to correspond when two homologous pro- 
teins separated by a particular degree of evolution- 
ary change are properly aligned. These target fre- 
quencies are then used to calculate "PAM scores" 
for the chosen evolutionary distance, using equa- 
tion (1) above. For years the most popular scoring 
systems for protein sequence comparison were 
based upon PAM-250 scores (Dayhoff et al. 1978), 
i.e., scores derived from the target frequencies cor- 
responding to 250 PAMs of protein evolution. More 
recently it has been argued that for database 
searches PAM- 120 scores are generally more effec- 
tive, and that in fact several different PAM matrices 



should be employed (Altschul 1991). Below we ex- 
tend this line of reasoning to propose an "AU- 
PAM" scoring system and to study its associated 
statistics. 

The All-PAM Scoring System 

A given PAM matrix is adapted to finding segments 
that have diverged by a specific amount of evolu- 
tionary change. Since in general it is not known 
what evolutionary distance is appropriate to an 
alignment that has yet to be found, it is important to 
use a variety of PAM matrices when comparing two 
sequences or searching a database with a query se- 
quence (Collins et al. 1988). It has been argued that 
for many purposes two or three different PAM ma- 
trices will suffice (Altschul 1991). Nevertheless, 
one may imagine trying all possible PAM matrices 
in order to find that one which will maximize an 
alignment's score. This in effect defines a new scor- 
ing system, but one with certain subtle aspects that 
will be considered below. 

Given a set of scores for PAM distance d and a 
pair of random protein sequences, the number of 
distinct segment pairs expected by chance to have a 
score of at least x is given by formula (2) above, 
where Kj (the K corresponding to PAM distance d) 
is readily calculated (KarUn and Altschul 1990). The 
value of K^ varies with PAM distance, decreasing 
as d increases. Therefore, if one uses two distinct 
PAM matrices to compare the same random protein 
sequences, the matrix of lower PAM distance is ex- 
pected to yield more high- scoring segment pairs. 
Assuming we have no reason to favor one PAM 
distance to another, we should correct for this sta- 
tistical effect by adjusting the PAM scores so that 
they all have the same asymptotic random distribu- 
tion. This can be accomplished easily by subtract- 
ing In Kj from the score of any segment pair. In 
other words, if 5^ is the score for a segment pair 
using a FAM-d matrix, define the corrected score as 



^ d ~ ^d 



InK^ 



(5) 



Notice that this does not change the scores for 
aUgning pairs of amino acids. Rather, every align- 
ment can be thought of as "starting" with the score 
-In K^. Such corrected scores have identical 
asymptotic extreme value distributions, with char- 
acteristic value In A^ and scale factor 1 . 

One difficulty with the approach just described is 
that since K^ gets arbitrarily small as d grows, the 
corrected score for great PAM distances may begin 
at arbitrarily large values. This is not really a prob- 
lem because, as is argued in Appendix A, for a 
given finite-sized comparison there is a greatest 
PAM distance it makes sense to consider. Letting D 
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be this largest PAM distance, we are in a position to 
define the All-PAM score A of an alignment: 



A = max S'd = max (Sd - In K4) 



(6) 



ds:D 



d'SD 



While the utilization of a range of PAM matrices to 
search for local protein similarities has been advo- 
cated previously (Collins et al. 1988; Altschul 1991; 
Gonnet 1993), equation (6) introduces the In K^ cor- 
rection, as well as the formal definition of an All- 
PAM similarity measure. We proceed below to 
study the statistical behavior of this function. 
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Fig. 1. The efficiency of PAM-d scores (d equals 5, 30, 70, 120, 
180, and 250) as a function of the actual PAM distance of the 
sequences being compared. PAM-d scores have their maximum 
efficiency (1.0) when the actual PAM distance equals d. 



An Extreme Value Theory for All-PAM Scores 

As discussed above, each corrected PAM score S' ^ 
has an extreme value distribution with characteris- 
tic value H = In N. No formal theory for the statis- 
tics of the All-PAM scores defined in equation (6) 
has yet been established. Nevertheless, analogy 
with other related scores (Smith et al. 1985; Arratia 
et al. 1986; Altschul and Erickson 1986; Arratia et 
al. 1988; Arratia and Waterman 1989; Karlin and 
Altschul 1990; Mott 1992) strongly suggests that 
they also should follow an extreme value distribu- 
tion. The key question is to what extent optimizing 
the score by allowing the PAM distance to vary 
increases the characteristic value of the optimal lo- 
cal alignment. 

One way to approach this problem is to imagine 
that optimizing over PAM distances between and 
D is essentially equivalent to optimizing over an 
effective number E of independent scoring systems. 
It is argued in Appendix B that, whatever the pre- 
cise form of the PAM model and whatever the 
c hoice of D, in the limit E should be proportional to 
Vlnlv. Optimizing over E independent scoring sys- 
tems is equivalent to multiplying the effective 
search space by E; it should therefore increase the 
characteristic value of the optimal alignment by In 
E. This implies that a formula for the characteristic 
value u of All-PAM scores should take the form 



1 

M = In A^ -t- - In In A^ + C 



(7) 



where the constant C will depend upon the PAM 
model and the range of PAM distances over which 
the All-PAM score is defined. (See Appendix B.) 

Equation (7) has been established analytically for 
certain simple cases (S. Karlin, personal communi- 
cation), but in the present context is only an hy- 
pothesis, and therefore can benefit from empirical 
support. Thus we have tested this model by random 
simulation, as described in the following section. 



A Monte Carlo Simulation 

While we wish to study the behavior of All-PAM 
scores, no efficient algorithm for truly optimizing 
VAM-d scores over a continuum of possible PAM 
distances d has been described. We have therefore 
approximated All-PAM scores by maximizing 5'j 
over a relatively small set of selected distances. An 
appropriate set of distances can be chosen by con- 
sidering the efficiency of various PAM matrices as a 
function of actual PAM distance (Altschul 1991; 
States et al. 1991). The efficiency of a PAM-d ma- 
trix at PAM distance x is defined as the ratio of the 
expected PAM-d to the expected VKM-x score, for 
a proper alignment of sequences actually separated 
by PAM distance x. Because the highest expected 
score at this distance is always the PAM-x score, 
the efficiency of PAM-d scores is always less than 
or equal to 1.0. Intuitively, it measures what pro- 
portion of the potential information available at 
PAM distance x is captured by using a PAM-d ma- 
trix. 

The efficiency curves for PAM-d scores, with d 
equal to 5, 30, 70, 120, 180, and 250, are graphed in 
Fig. 1. It is evident that at virtually all PAM dis- 
tances in the reaUstically detectable range to 290 
(Appendix A), the best of these six scoring systems 
is at least 98% efficient. In the Monte Carlo simu- 
lation discussed below, All-PAM scores were there- 
fore approximated by maximizing over PAM scores 
for these six discrete distances. 

We modeled protein sequences by strings of in- 
dependently chosen amino acids, selected with the 
frequencies used in the PAM model of Dayhoff et 
al. (1978). For a given pair of lengths m and n (yield- 
ing a search space of size A^ = mn), we generated 
two random protein sequences and found their ap- 
proximate All-PAM score, as described above. Us- 
ing the method of moments (Altschul and Erickson 
1986), we estimated u and X for the extreme value 
distribution corresponding to lengths m and n from 
20,000 such random scores. For m fixed at 248 and 
varying n, these estimates u and X are given in Table 
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Table 1. The characteristic value of AU-PAM MSP scores as a 
function of search space size'* 



InlnAf 



M - IniV 



248 


2.40 


11.999 


0.972 


0.97 


309 


2.42 


12.250 


1.003 


0.96 


387 


2.44 


12.495 


1.023 


0.98 


489 


2.46 


12.731 


1.025 


0.97 


619 


2.48 


12.979 


1.037 


0.97 


788 


2.50 


13.237 


1.054 


0.97 


1,007 


2.52 


13.490 


1.062 


0.98 


1,295 


2.54 


13.746 


1.066 


0.95 


1,673 


2.56 


14.011 


1.075 


0.97 


2,172 


2.58 


14.296 


1,099 


0.97 


2,836 


2.60 


14.565 


1.101 


0.97 


3,723 


2.62 


14.858 


1.122 


0.97 



" Two random protein sequences of lengths m = 248 and n were 
chosen, and their AU-PAM MSP score was approximated by 
maximizing S' over PAM distances 5, 30, 70, 120, 180, and 250. 
From 20,000 such scores, estimates u and \ of the characteristic 
value and scale factor for an extreme value distribution were 
obtained by the method of moments (Altschul and Erickson 
1986). Values of S, m - In Af, and X are reported as a function of 
In In N, where JV = mn. The standard error for each is 0.01 



1, and M - In N is plotted in Fig. 2 as a function of 
In In N. The standard error for each u and \ is 0.01 . 

Equation (7) implies that the quantity u - \n N 
should be a linear function of In In N, with slope 1/2. 
Except for the smallest N, Fig. 2 shows that the 
data fit the theory remarkably well. Linear regres- 
sion on the points with In In Af ^ 2,42 yields a slope 
of 0.55 ± 0.05. The line plotted in Fig. 2,u-\nN 
= 0.5 In In iV - 0.199, is the best-fitting one with 
slope 1/2. Similar results were obtained for other 
values of m (data not shown). 

Because In In N is such a slowly growing func- 
tion, it is not easy to detect the hypothesized devi- 
ation of u from In AT with increasing N. From one 
point to the next in Fig. 2, the y-coordinate is ex- 
pected to increase by only 0.01, but even with 
20,000 random trials per point this is the standard 
error for y. Nevertheless, over the domain investi- 
gated, the slope of the line becomes well defined. 
Thus, given our domain is appropriate for detecting 
asymptotic effects, and given the correct formula 
for u does contain a log-log term, we can confi- 
dently conclude that its coefficient is much closer to 
1/2 than it is to either 1 or 0. Appendix B , of course, 
argues that the coefficient is exactly 1/2. 

Of less relevance to the main thrust of this paper, 
but nevertheless of some interest, is the idea that 
while in the infinite limit the scale factor X. for the 
extreme value distribution of All-PAM scores 
should approach 1 , for finite N a better estimate of 
\ is 1 - (l/2u). This is discussed briefly in Appen- 
dix B and is supported by the Monte Carlo esti- 
mates of X recorded in Table 1 . 

A closer approximation to true AU-PAM scores 
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Fig. 2, The difference between the estimated characteristic ex- 
treme value u of All-PAM scores and In N, as a function of In In 
N, where N is the size of the search space, u was estimated by 
the method of moments (Altschul and Erickson 1986) from 
20,000 random values. Error bars represent the standard error of 
the estimates. The theory represented by equation (7) implies 
these points should lie on a line with slope 1/2. The best such line 
for fitting those points with ordinate greater than 2.40 is plotted. 



than the one studied above would yield slightly 
higher scores, and thus a slightly higher value of C 
than the -0.20 found above. Random simulation as 
well as theoretical considerations (Appendix B) 
suggest that for true All-PAM scores over the range 
to 290, C would be increased by about 0.15 to 
-0.05. 

Practical Approximations to All-PAM Scores 

It is instructive here to return to the motivation for 
equation (7): the idea that the effective number E of 
independent PAM matrices embodied by All-PAM 
scores is proportional to Vln N. The fact that, for 
the All-PAM range studied, C in (7) is about -0.05, 
means that E can be well approximated as 



E -= 0.95 Vinlv 



(8) 



From equation (8), we calculate that when two pro- 
teins of average length are compared, N ~ 60,000 
and E = 3.2. Similarly, when a protein of average 
length is compared with a comprehensive protein 
sequence database of typical current size, N ~ 10^° 
and E ~ 4.6. These numbers are important for de- 
ciding when the theory developed here has fruitful 
appUcation. For instance, if All-PAM scores for 
protein database searches are approximated by 
taking the maximum of £* < 5 distinct PAM-cf ma- 
trices, then treating these matrices as independent, 
i.e., estimating u as In {E*N), is conservative but 
still superior to equation (8). On the other hand, if 
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All-PAM scores are estimated by taking the maxi- 
mum of five or more corrected PAM-J scores, 
equation (8) provides a tighter upper bound on the 
effective number of independent comparisons actu- 
ally performed. As described above, 98% efficiency 
can be achieved over the detectable PAM range by 
using the six PAM matrices 5, 30, 70, 120, 180, and 
250 (Fig. 1). For typical database searches these, or 
any greater number of matrices, count as fewer than 
five independent scoring systems. It should be 
noted that nothing is wrong with optimizing PAM 
distance "after the fact." In other words, given an 
alignment found by any method, one may vary d at 
will in order to maximize the aUgnment's S' ^ score 
and still obtain a valid estimate of this All-PAM 
score's statistical significance. 

For database searching, All-PAM MSP scores 
can be calculated in reasonable time by specially 
designed microchips (Chow et al. 1991; Hughey 
1991 ; White et al. 1991), by parallel architecture ma- 
chines (Coulson et al. 1987), or by fast heuristic 
approximation algorithms such as BLAST (Altschul 
et al. 1990). For the last of these, running the same 
search many times using a large number of different 
PAM matrices can result in an unacceptable de- 
crease in speed. This can be avoided by using at 
first only a small number of widely spaced PAM 
matrices, e.g., PAM-30, -120, and -250. These three 
matrices alone should be able to locate any segment 
pair with potentially significant All-PAM score. 
Once the relevant alignments have been located, 
they may be rescored (with possible attendant trim- 
ming or extension) by using a more tightly spaced 
set of matrices. This strategy allows All-PAM 
scores to be approximated closely in only a small 
multiple of the time required for a single PAM-t/ 
score. 

The use of All-PAM scores implies an ignorance 
concerning the evolutionary distance inherent in the 
similarities sought; the data are used to select an 
optimal distance. The estimation of this unknown 
parameter must be paid for with lost statistical sig- 
nificance. As discussed above, for typical database 
searches, the choice of an optimal PAM distance 
from the range to 290 costs a factor of about 4.6 in 
significance, or about 2.2 bits of information. The 
single set of PAM- 120 scores is superior to such 
All-PAM scores if one knows for certain that the 
similarities sought have PAM distances between 70 
and 170. The problem arises for similarities with 
impUcit PAM distance outside this range. PAM-120 
scores are, respectively, only 82% and 50% efficient 
for alignments with true PAM distances 30 and 250, 
Thus significant alignments at these two distances, 
with 36 bits of potential (i.e., All-PAM) informa- 
tion, are left, respectively, with PAM-120 scores of 
only 30 and 18 bits. Employing All-PAM scores can 



therefore be seen as a decision always to sacrifice 
about two bits of information so as to avoid occa- 
sional losses of much greater magnitude. 

A Biological Example 

The value of All-PAM scores is perhaps best ob- 
served in a database search employing as query se- 
quence a member of a large and diverse protein 
superfamily. For such a sequence, proteins related 
at a wide range of PAM distances are likely simul- 
taneously to be present. All-PAM scores are of use, 
however, whenever the relationships sought are of 
unknown implicit distance. 

We used the BLAST algorithm (Altschul et al. 
1990) to search the PIR protein sequence database 
(Release 31; Barker et al. 1990) with soybean leghe- 
moglobin c 1 (PIR code GPSYCl; Hyldig-Nielsen et 
al. 1982), using amino acid substitution matrices for 
PAM distances 5, 30, 70, 120, 180, and 250. From 
these searches many proteins exhibited statistically 
significant similarities to the query. Alignments il- 
lustrating three such similarities are shown in Fig. 
3, along with the All-PAM score for each rounded 
to the nearest bit; the PAM matrix from which the 
All-PAM score derives is indicated in brackets. In 
the context of the search performed A^ «= 1.5 x 10^; 
from the formulas developed above one may con- 
clude that an All-PAM score of 37 bits or greater is 
statistically significant {P value « 0.05). All three 
alignments shown in Fig. 3 represent true homolo- 
gies and meet this significance criterion. 

How do these three homologous pairs of se- 
quence fare under the individual scoring systems 
whose maximum constitutes the All-PAM score? 
For a database search with A'^ as before, and using a 
single substitution matrix, a corrected PAM-J score 
of 35 bits is statistically significant; as discussed 
above, about two fewer bits are required than for 
All-PAM statistics. Table 2 records, for each 
PAM-J matrix employed, S' ^ for the three homol- 
ogous sequence pairs. It is evident that at no indi- 
vidual distance d do all three alignments simul- 
taneously appear significant. For instance, if 
PAM-120 scores are used, the alignments involving 
the soybean leghemoglobin fragment (PIR code 
B32711; Stougaard et al. 1987) and the bacterial he- 
moglobin (PIR code GGZLB; Wakabayashi et al. 
1986) both appear highly significant, but the long 
and weak alignment involving sea cucumber globin 
(PIR code S 15979; Mauri et al. 1991) is missed. 

In most database searches, homologies at such a 
range of evolutionary distances will not be found, 
and it will perhaps appear "after the fact" that a 
single specific VKM-d matrix was the correct one to 
use. The challenge, however, is to select this matrix 
before the search is performed. As argued earlier 
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(a) All-PAM score: 48 bits [PAM-30] 



P-value : 4 x 10"' 



GPSYCl 3 FTEKQEALVSSSFEAFKANIP 23 

FT Q ALV SS EAFK N P 
B32711 2 FTAQQDALVGSSYEAFKQNLP 22 



(b) All-PAM score: 41 bits [PAM-120] 



P-value: 0.004 



GPSYCl 92 HAQKAVTDPQFVWKEALLKTIKEAVGGNWSDELSSAWEVAYDELA 137 

H Q V V LL IKE G D AW AY A 

GGZLB 85 HCQAGVAAAHYPIVGQELLGAIKEVLGDAATDDILDAWGKAYGVIA 130 



(c) All-PAM score: 37 bits [PAM-250] 



P-value: 0.05 



GPSYCl 
315979 
GPSYCl 



47 LANGVDPTNPKITGHAEKLFAIVRDSAGQLKTNGTWADAALVSIHA 93 

L T HAAIi T ALH 

59 LSPAELRTSRQMHAHAIRVSALMTTYIDEMDTEVLPELLATLTRTHD 105 



94 QKAVTDPQFVWKEALLKTIKEAVGGNWSDELSSAWEVAYDELAAAI 140 
V L IK G AW 

S15979 106 KNHVGKKNYDLFGKVLMEAIKAELGVGFTKQVHDAWAKTFAIVQGVL 152 



Fig. 3. The AU-PAM maximal segment 
pairs for comparisons of soybean 
leghemoglobin c 1 (PIR code GPSYCl) 
with (a) a soybean leghemoglobin fragment 
(PIR code B32711); (b) a bacterial 
hemoglobin (PIR code GGZLB); and (c) a 
sea cucumber globin (PIR code SI 5979). 
Identical residues are echoed on the 
central line of each alignment. P values are 
based on All-PAM statistics and a search 
space size of Af = 1.5 x 10^. 



Table 2. Corrected PAM-d scores for optimal local alignments 
of soybean leghemoglobin c 1 (PIR code GPSYCl) with various 
other protein sequences" 



PIR 


( 


Corrected VAU-d 


score (in bits) for d equal to 


code 


5 


30 


70 


120 


180 


250 


B32711 


36 


48 


46 


40 


33 


27 


GGZLB 


16 


18 


37 


41 


40 


35 


S 15979 


17 


18 


19 


29 


35 


37 



" B32711, soybean leghemoglobin fragment; GGZLB, bacterial 
hemoglobin; S15979, sea cucumber globin 



(Altschul 1991), the PAM-120 matrix is probably the 
best single choice for general purposes, but as seen 
here it can have its failings. The more conservative 
course therefore is to employ All-PAM scores for 
the realistically detectable range of evolutionary 
distances. 



Discussion and Conclusion 

The idea of seeking biologically interesting similar- 
ities by using a range of different PAM matrices was 
first investigated, in the global alignment context, 
by the originators of the PAM model of molecular 
evolution (Schwartz and Dayhoff 1978). This pro- 
posal has been frequently advocated since (Collins 
et al. 1988; Altschul 1991; States et al. 1991; Gonnet 
1993). In this paper we have attempted to give the 
idea a formal embodiment in the form of All-PAM 
scores and to study the statistics of this measure of 



sequence similarity. The study has yielded several 
conclusions. If one does not know a priori the ap- 
propriate PAM distance for the similarities sought, 
there is a price for estimating this parameter from 
the data. In the context of a typical protein se- 
quence database search, choosing an optimal PAM 
distance in the range 0-290 costs a factor of about 
4.6 in statistical significance (2.2 bits). If one is un- 
willing to pay this price — for example, if one uses 
only PAM-120 scores — one may gain slightly for the 
bulk of similarities in the PAM-70 to PAM- 170 
range, but those far outside this range may be 
missed completely. 

Some mention should be made here of the "non- 
Unear similarity functions" proposed earlier by the 
author (Altschul and Erickson 1986). The motiva- 
tion for these functions was similar in spirit to that 
for the All-PAM scores studied above. However, 
both theory and experience now suggest that All- 
PAM scores are superior for the purpose of reveal- 
ing biologically significant similarities. It is worth 
noting that statistical studies of the "nonlinear sim- 
ilarity functions" revealed the presence of a log-log 
term in the growth of their characteristic values 
(Altschul and Erickson 1986; Altschul and Erickson 
1988; Waterman and Gordon 1990). It is likely that 
an analysis such as that outlined here could be ap- 
plied successfully to those functions as well. 

We have discussed All-PAM scores only in the 
context of local alignments lacking gaps. Unfortu- 
nately, no good statistical theory yet exists to per- 
mit the relaxation of this restriction. Certain results 
suggest, however, that the general spirit of our anal- 
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ysis should apply as well to alignments with gaps 
(Smith et al. 1985; Waterman et al. 1987; Mott 
1992). Even when gaps are allowed, however, it 
should be understood that alignment scores based 
on simple substitution matrices, such as those stud- 
ied here, exclude many potential sources of infor- 
mation concerning biological relatedness. If several 
related sequences are available to query a database, 
rather than just a single one, more general weight 
matrices or "profiles" can often be of use (Taylor 
1986; Gribskov et al. 1987; Patthy 1987). Even in 
the context of pairwise alignments, complex scoring 
systems may be able to extract more evidence for 
biological relatedness than can simple substitution 
matrices (Argos 1987). However, algorithms for op- 
timizing such measures can require considerably 
more computation time. The heuristic BLAST pro- 
grams are able to search typical protein sequence 
databases for optimal AU-PAM alignments in a mat- 
ter of minutes (Altschul et al. 1990), while when run 
on a similar machine, programs based on more com- 
plicated measures may require as much as 2 weeks 
(Vogt and Argos 1992). 

While the analysis in this paper has been tied to 
the PAM model of protein evolution, most of the 
ideas developed should apply to any singly param- 
etrized set of scores used either for sequence com- 
parison (e.g., the recently developed system of 
Henikoff and Henikoff 1992) or for "single se- 
quence analysis" (Kariin and Altschul 1990). Such 
parametrized scores arise naturally from an evolu- 
tionary model, but they also occur in other con- 
texts. For example, one may seek regions of a pro- 
tein rich in a given residue type (Kariin et al. 1991), 
but not know a priori what target frequency to ex- 
pect; a parametrized range of scores is then appro- 
priate. 

In summary, the theory developed in this paper 
elucidates some of the choices implicit in the selec- 
tion of a specific amino acid substitution matrix, or 
of AU-PAM scores. It is hoped that this will lead to 
better informed selection of scoring systems for 
macromolecular sequence comparison, and there- 
fore to more sensitive detection of biologically in- 
teresting similarities. While these ideas have not 
been shown to extend beyond the particular context 
in which they have been presented, their general 
spirit should find wider application. 



background noise. This intuition is captured quan- 
titatively by the relative entropy H^ corresponding 
to PAM distance d (Altschul 1991). In short, the 
expected PAM-c/ score per properly aligned residue 
pair is given by the formula 



Hd = Zj 1u ^^ 



3L 
PiPj 



(9) 



where the q^j are the PAM-6? target frequencies 
(Altschul 1991). High-scoring random segment pairs 
will also be characterized by these frequencies 
(Kariin and Altschul 1990; Kariin et al. 1990; 
Dembo and Kariin 1991), and thus by the relative 
entropy H^. Employing this quantity, a rough anal- 
ysis of the greatest PAM distance D detectable in a 
comparison of two sequences of length m is possi- 
ble. 

For large d, the correction constant K^ employed 
above approaches H^ (S. Kariin, personal commu- 
nication). Therefore two sequences of length m that 
are related at PAM distance d, have an expected S'^ 
given by 



mH^ - In iC^ = mHn ~ In H^ 



(10) 



Since the characteristic maximal S'^ expected to 
occur by chance from the comparison of two such 
sequences is In A' = In m^, we require that 



mHj - In //^ > In m^ 



(11) 



if the score for the true relationship is to rise above 
background noise. Elementary calculus shows that 
the left-hand side of (11) decreases monotonically 
with decreasing H^, when H^ > II m. (Beyond this 
point, arbitrarily large but misleading scores S'^ can 
be achieved simply by choosing d large, making H^ 
arbitrarily small. It can be shown, however, that the 
correction term —lnK^~ - In H^ in the definition 
of 5'rfis valid only in the limit of large m and that Hj 
should therefore always be kept larger than 1/m.) 
Simple substitution shows that for H^ = (In rri)lm, 
the bound of inequality (11) is already violated. We 
will therefore take the PAM distance D that yields 



Hn 



In m 



m 



(12) 



Appendix A: Approximate Upper Bounds on 
Detectable PAM Distances 

The greater the PAM distance by which two seg- 
ments have diverged, the less information each 
aligned pair of residues carries, and the longer the 
segment pair need be for its signal to rise above 



as an effective upper bound on detectable PAM dis- 
tances. Notice that if two sequences of unequal 
lengths are compared, and m is the length of the 
shorter sequence, the bound of (12) is still conser- 
vative. Using this equation for a typical protein 
length of m = 250, we get H^ = 0.022 nats, which 
corresponds to a PAM distance D ~ 680. 
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Relative 

Entropy 

(bits) 




Criticai 
Length 
(30 bits) 



100 200 

PAM distance 

Fig. 4. The relative entropy and the length of the shortest align- 
ment to attain significance, as a function of PAM distance. Rel- 
ative entropy decreases with increasing PAM distance, is ex- 
pressed in bits, and is measured by the scale to the left of the 
graph. Critical length — the length in residues of the shortest 
alignment expected to achieve a score of 30 bits — increases with 
increasing PAM distance and is measured by the scale on the 
right of the graph. 



The foregoing analysis assumes that a homology 
stretching the entire length of the protein is present, 
and of course that it does not involve gaps. In prac- 
tice, it is rare that regions of such length are con- 
served over great evolutionary distances. It is much 
more typical for shorter regions, crucial in some 
way to function, to be conserved, and for such re- 
gions only smaller PAM distances will be detect- 
able. For typical database searches an uncorrected 
MSP score of at least 30 bits generally is required to 
distinguish a protein similarity from background 
noise (Altschul 1991). For any PAM distance d, the 
relative entropy H^ of equation (9) allows us to an- 
alyze how long an MSP need be to achieve such a 
score. In Fig. 4 is graphed H^ (expressed in bits) as 
a function of PAM distance, along with the mini- 
mum alignment length required on average to attain 
30 bits of information. It can be seen that by PAM 
distance 290, an MSP needs to be over 100 residue 
pairs long to achieve significance. While as de- 
scribed above greater PAM distances theoretically 
are detectable, this is an effective upper bound for 
real protein database searches, and the one we em- 
ploy in this paper. 



Appendix B: The Characteristic Extreme Value of 
AU-PAM Scores 

Comparing two random proteins of lengths m and n, 
let Ap and A^ be the maximal segment pairs at the 
two PAM distances c and d. As described in the 
text, the corrected scores for these alignments 
should have identical asymptotic extreme value dis- 
tributions with characteristic value In N, where N 



= mn. For c and d very different, these alignments 
should be essentially independent. If 5 is the max- 
imum of their scores, 5 should then be extreme 
value distributed, with characteristic value In 2N 
(Gumbel 1958). On the other hand, if c = d,S triv- 
ially has characteristic value In N. For c close but 
not identical to d, we expect an intermediate case: S 
should be extreme value distributed with character- 
istic value In EN for some number E between 1 
and 2. 

Maximizing a continuous range of corrected 
PAM-d scores, with d between and D, confronts 
us with an analogous situation. In a small neighbor- 
hood of a given d the optimal alignments are all 
correlated, so their maximum score should not dif- 
fer much from S'^. A large range of d, however, 
should encompass a number of essentially indepen- 
dent aUgnments. Maximizing S'^ should again be 
equivalent to maximizing over an effective number 
E of independent scoring systems. It is hypothe- 
sized here that E should not b e fixed but should 
grow proportionally with Vlnlv. More specifically, 
for a continuous range of PAM distance indexed by 
the parameter x, it is hypothesized that in the limit 
of large A^, E is given by the formula 



E = ! V(J^ In I^/IttH^ dx 
= Vhilv / VjJl-uH^ dx 



(13) 



where H^ is the relative entropy of PAM-x scores 
defined in (9), J^ is the Fisher information (Fisher 
1925) of the target frequency distribution at distance 
X, and the integration is of course performed over 
the relevant range of PAM distances. The motiva- 
tion for this equation will be outlined below. Here 
we merely observe that equation (7) in the text fol- 
lows from equation (13), with C given by 



C = In / VjJhdT^ dx 



(14) 



We also note that this formula for C is invariant, as 
it must be, under any alternative parametrization of 
the chosen range of PAM distance. 

This hypothesis embodied in equation (13) origi- 
nates from the idea that for a continuous, parame- 
teized set of correlated random variables (in this 
case the alignments AJ there should be some quan- 
tity that behaves like a "density" of independent 
random variables. The most suitable candidate for 
this quantity derives from the Fisher information J^ 
(Fisher 1925) for the target frequency distribution q^ 
at PAM distance x. In the limit of large N, an opti- 
mal local alignment for this distance is expected to 
have length (In N)IH^, and its aligned amino acid 
pairs are expected to occur with the target frequen- 
cies q^ (Karlin and Altschul 1990; KarUn et al. 1990; 
Dembo and KarUn 1991). Since the Fisher informa- 
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tion for n independent identically distributed (i.i.d.) 
samples of a random variable is n times the Fisher 
information for an individual sample (Cover and 
Thomas 1991), the Fisher information for A^ should 
approach {J^ In N)IH^. Intuitively, this quantity is a 
measure of the rate at which the alignments A^ be- 
come independent of one another. While the Fisher 
information should correlate with the sought "den- 
sity" of independent random variables, a scaling 
argument shows that only its square root provides a 
consistent measure under any change in parametri- 
zation. (Alternatively, the second derivative of the 
efficiency curves shown in Fig. 1, at their maxi- 
mum, can be shown to equal -JJH^. Since the 
maximum score is always approximately In N, the 
"density of curves" required to lose no more than e 
bits of information is therefore proportional to 
V(/^ In AO//fJ. Short of the factor V2^ in the de- 
nominator, this motivates equation (13). This final 
factor can be derived from the limiting case that an 
isolated PAM matrix should count as one indepen- 
dent random variable. 

The arguments above in support of equations (13) 
and (14) are completely intuitive and nonrigorous. 
Nevertheless, they help motivate the 1/2 In In N 
term in equation (7), which can be established for 
certain simple cases (S. Karlin, personal communi- 
cation). The arguments generaUze naturally to sets 
of scores parametrized by k independent variables, 
for which they imply a kl2 In In TV growth term in 
equation (7); this also can be estabhshed rigorously 
in some simple cases (S. Karlin, personal commu- 
nication). 

Using equation (14) to estimate C for the effec- 
tive PAM range 0-290 discussed in Appendix A 
yields C ~ -0.05. This is slightly larger than the 
value of C = -0.20 found by the Monte Carlo sim- 
ulation in the text. That simulation, however, is 
based upon an approximation to All-PAM scores 
using matrices for the six discrete PAM distances 5, 
30, 70, 120, 180, and 250. True All-PAM scores 
would have yielded a somewhat greater value for C. 
While a rigorous theory is perhaps unlikely to val- 
idate equation (14) in detail, its success here sug- 
gests it may nevertheless have a place as a useful 
approximation. 

In the limit of large N we expect All-PAM MSP 
scores to approach an extreme value distribution 
with characteristic value u given by equations (7) 
and (14), and scale factor A, = 1. It should be noted, 
however, that for finite A^ a small correction to X is 
appropriate. This derives from the fact that, inde- 
pendent of A'^, a higher All-PAM score implies a 
greater number of "effective independent trials." 
Properly accounting for this suggests that for All- 
PAM scores X should be approximately 1 - (1/2m), 
which of course approaches I in the limit of large A'^. 



The Monte Carlo estimation of X. described in the 
text and recorded in Table 1 supports this correc- 
tion. 
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